Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
💾 Prompt Caching
Specific
Context Reuse, KV Cache, Inference Optimization, Token Efficiency
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
27359
posts in
52.7
ms
Prefix
caching
for LLM inference optimization
🧠
LLM Inference
bentoml.com
·
2d
·
Hacker News
·
…
From 300KB to 69KB per Token: How LLM
Architectures
Solve the
KV
Cache Problem
🔮
Prefetching
news.future-shock.ai
·
5d
·
Hacker News
·
…
jhammant/Turbo1bit
:
Turbo1Bit
: Combining 1-bit LLM weights (Bonsai) with TurboQuant KV cache compression for maximum inference efficiency. 4.2x KV cache compression + 16x weight compression = ~10x total memory reduction.
🔬
RaBitQ
github.com
·
4h
·
Hacker News
·
…
MAC-Attention: a
Match-Amend-Complete
Scheme for Fast and Accurate Attention
Computation
🧠
LLM Inference
arxiv.org
·
21h
·
…
Autodebug
:
Telemetry-Driven
Inference Optimization Loop
🏗️
LLM Infrastructure
graphsignal.com
·
2d
·
Hacker News
·
…
Use Prompt
Caching
to Reduce Input
Tokens
with Claude
🔤
Tokenization
pub.towardsai.net
·
3d
·
…
The
architectural
trade-offs
of AI code generation
💻
Coding Agents
aha.io
·
2d
·
Hacker News
·
…
TurboQuant
: Building a Sub-Byte KV Cache
Quantizer
from Paper to Production
🔬
RaBitQ
demo.aitherium.com
·
6d
·
Hacker News
·
…
Tq-KV
– Rust implementation of
TurboQuant
that works on GGUF models
🎯
Qdrant
news.ycombinator.com
·
1d
·
Hacker News
·
…
What if AI doesn’t need more
RAM
but better
math
?
🧠
LLM Inference
adlrocha.substack.com
·
4d
·
Substack
·
…
Pure C implementation of the
TurboQuant
paper (
ICLR
2026) for KV cache compression in LLM inference.
🗜️
Vector Compression
github.com
·
1d
·
r/LocalLLaMA
·
…
What I Found
Interesting
in Claude Code's Source
✏️
Code Editors
shloked.com
·
3d
·
…
TurboQuant
: KV Cache Quantization to 3.5 Bits with Zero Accuracy Loss-
ICLR
2026
🗜️
Vector Compression
darshanfofadiya.com
·
4d
·
Hacker News
·
…
Memoriant/dgx-spark-kv-cache-benchmark
: KV cache quantization benchmarks on NVIDIA DGX Spark GB10 — three novel findings including the
dequantization
cliff and unified memory paradox
🔮
Prefetching
github.com
·
2d
·
Hacker News
·
…
Will Google's
TurboQuant
AI Compression Finally
Demolish
the AI Memory Wall?
📱
Edge AI Optimization
buysellram.com
·
6d
·
Hacker News
·
…
KVSculpt
: KV Cache Compression as
Distillation
🗜️
Vector Compression
arxiv.org
·
2d
·
…
SharpAI/SwiftLM
: ⚡ Native Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, + iOS iPhone app.
🏗️
LLM Infrastructure
github.com
·
1d
·
Hacker News
,
Hacker News
·
…
The
KV
Cache. Every LLM Running Today Is Built Around One Number
Staying
Still.
💾
CPU Caching
pub.towardsai.net
·
4d
·
…
ScoutAttention
: Efficient KV Cache
Offloading
via Layer-Ahead CPU Pre-computation for LLM Inference
🧠
LLM Inference
arxiv.org
·
2d
·
…
KV
Cache
Quantization
for Self-Forcing Video Generation: A 33-Method Empirical Study
🔬
RaBitQ
arxiv.org
·
2d
·
…
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help